38 research outputs found

    A Web-Based Medical Text Simplification Tool

    Get PDF
    With the increasing demand for improved health literacy, better tools are needed to produce personalized health information efficiently that is readable and understandable by the patient. In this paper, we introduce a web-based text simplification tool that helps content-producers simplify existing text materials to make them more broadly accessible. The tool uses features that provide concrete suggestions and all features have been shown individually to improve the understandability of text in previous research. We provide an overview of the tool along with a quantitative analysis of the impact on medical texts. On a medical corpus, the tool provides good coverage with suggestions on over a third of the words and over a third of the sentences. These suggestions are over 40% accurate, though the accuracy varies by text source

    Modeling Word Burstiness Using the Dirichlet Distribution

    Get PDF
    Multinomial distributions are often used to model text documents. However, they do not capture well the phenomenon that words in a document tend to appear in bursts: if a word appears once, it is more likely to appear again. In this paper, we propose the Dirichlet compound multinomial model (DCM) as an alternative to the multinomial. The DCM model has one additional degree of freedom, which allows it to capture burstiness. We show experimentally that the DCM is substantially better than the multinomial at modeling text data, measured by perplexity. We also show using three standard document collections that the DCM leads to better classification than the multinomial model. DCM performance is comparable to that obtained with multiple heuristic changes to the multinomial model. 1

    Data-driven sentence simplification: Survey and benchmark

    Get PDF
    Sentence Simplification (SS) aims to modify a sentence in order to make it easier to read and understand. In order to do so, several rewriting transformations can be performed such as replacement, reordering, and splitting. Executing these transformations while keeping sentences grammatical, preserving their main idea, and generating simpler output, is a challenging and still far from solved problem. In this article, we survey research on SS, focusing on approaches that attempt to learn how to simplify using corpora of aligned original-simplified sentence pairs in English, which is the dominant paradigm nowadays. We also include a benchmark of different approaches on common datasets so as to compare them and highlight their strengths and limitations. We expect that this survey will serve as a starting point for researchers interested in the task and help spark new ideas for future developments

    TABLE OF CONTENTS

    No full text
    b

    Feature-Based segmentation of narrative documents

    No full text
    In this paper we examine topic segmentation of narrative documents, which are characterized by long passages of text with few headings. We first present results suggesting that previous topic segmentation approaches are not appropriate for narrative text. We then present a feature-based method that combines features from diverse sources as well as learned features. Applied to narrative books and encyclopedia articles, our method shows results that are significantly better than previous segmentation approaches. An analysis of individual features is also provided and the benefit of generalization using outside resources is shown.

    Contributions to research on machine translation

    No full text
    In the past few decades machine translation research has made major progress. A researcher now has access to many systems, both commercial and research, of varying levels of performance. In this thesis, we describe different methods that leverage these pre-existing systems as tools for research in machine translation and related fields. We first examine techniques for improving a translation system using additional text. The first method uses a monolingual corpus. Discrepancies are identified by translating a word list to a foreign language and back again. Entries where the original word and its double translation differ are used to learn word-level correction rules. The second method uses parallel bilingual data consisting of source language/target language sentence pairs. The source sentences are translated using a translation system, and a partial alignment is identified between the machine-translated sentences and the corresponding human-translated sentences in the target language. This alignment is used to generate phrase-level correction rules. Experimentally, both word-level and phrase-level correction rules result in improved translation performance. The learned word-level correction rules make 24,235 corrections on 20,000 Spanish to English translated sentences, with high accuracy. The learned phrase-level rules improve the translation performance (as measured by BLEU) of a French to English commercial system by 30%, and of a state of the art phrase-based system in a statistically significantly way. To train current statistical machine translation systems, bilingual examples of parallel sentences are used. Generating this data is costly, and currently feasible only in limited domains and languages. A fundamental question is whether every potential example is equally useful. We describe a ranking method for examples that scores individual sentence pairs based on the performance of translation systems trained on random subsets of the examples. When used to train a translation system, the top ranking examples result in a significantly better performing system than random selection of examples. Given these ranked examples, a model of example usefulness can potentially be learned to select the most useful unlabeled examples. Initial experiments show two previously used example features are good candidates for identifying useful examples. In the last part of this thesis we describe how automatic paraphrasing methods can be used to improve the accuracy of evaluation measures for machine translation. Given a human-generated reference sentence and a machine-generated translated sentence, we present a method that finds a paraphrase of the reference sentence that is closer in wording to the machine output than the original reference is. We show that using paraphrased reference sentences for evaluating a translation system output results in better correlation with human judgement of translation adequacy than using the original reference sentence

    Effect of Boosting in BWI

    No full text
    Recent work in information extraction has brought about a new method for text extraction using wrappers. A wrapper is a simple, but highly accurate extraction procedure. Unfortunately, these wrappers tend to have low recall. To remedy this problem, boosted wrapper induction (BWI) was proposed. This method combines a weak wrapper learner with AdaBoost to generate a more general extraction rule. The result is an algorithm with a bias towards precision, but with reasonable recall in both traditional extraction domains. The exact benefit of boosting over more traditional approaches is not always apparent. In this paper, we examine the benefits of boosting by comparing BWI to two different sequential covering algorithms with wrappers for text extraction in the framework of both highly structured and natural text. Sequential covering is a simple, straightforward algorithm which tries to cover as many possible positive examples with a single rule, removes the covered examples from the training set and continues until all of the positive examples have been covered. We show results from a broad range of information extraction tasks and show that the basic benefit of boosting in this domain is to allow BWI to continue learning new and helpful rules without over fitting the training data even after all of the positive examples are covered. This result is consistent with previous theoretical and experimental results.
    corecore